A Study of Semi-discrete Matrix Decomposition for LSI in Automated Text Categorization

نویسندگان

  • Qiang Wang
  • Xiaolong Wang
  • Guan Yi
چکیده

This paper proposes the use of Latent Semantic Indexing (LSI) techniques, decomposed with semi-discrete matrix decomposition (SDD) method, for text categorization. The SDD algorithm is a recent solution to LSI, which can achieve similar performance at a much lower storage cost. In this paper, LSI is used for text categorization by constructing new features of category as combinations or transformations of the original features. In the experiments on data set of Chinese Library Classification we compare accuracy to a classifier based on k-Nearest Neighbor (k-NN) and the result shows that k-NN based on LSI is sometimes significantly better. Much future work remains, but the results indicate that LSI is a promising technique for text categorization.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Information Retrieval System in Bahasa Indonesia Using Latent Semantic Indexing and Semi-Discrete Matrix Decomposition

The focus of this paper is exploring the use of Latent Semantic Indexing (LSI) and Semi-Discrete Matrix Decomposition (SDD) in Bahasa Indonesia Information Retrieval System. The method is to take advantage of implicit higher-order structure in association of terms with document (" semantic structure ") in order to improve the detection of relevant document on the basis of terms found in queries...

متن کامل

Using Web Searches on Important Words to Create Background Sets for LSI Classification

The world wide web has a wealth of information that is related to almost any text classification task. This paper presents a method for mining the web to improve text classification, by creating a background text set. Our algorithm uses the information gain criterion to create lists of important words for each class of a text categorization problem. It then searches the web on various combinati...

متن کامل

Improving Methods for Single-label Text Categorization

As the volume of information in digital form increases, the use of Text Categorization techniques aimed at finding relevant information becomes more necessary. To improve the quality of the classification, I propose the combination of different classification methods. The results show that k-NN-LSI, the combination of k-NNwith LSI, presents an average Accuracy on the five datasets that is highe...

متن کامل

Text Categorization and Information Retrieval Using WordNet Senses

In this paper we study the influence of semantics in the Text Categorization (TC) and Information Retrieval (IR) tasks. The K Nearest Neighbours (K-NN) method was used to perform the text categorization. The experimental results were obtained taking into account for a relevant term of a document its corresponding WordNet synset. For the IR task, three techniques were investigated: the direct us...

متن کامل

Automated Gene Classification using Nonnegative Matrix Factorization on Biomedical Literature

Understanding functional gene relationships is a challenging problem for biological applications. High-throughput technologies such as DNA microarrays have inundated biologists with a wealth of information, however, processing that information remains problematic. To help with this problem, researchers have begun applying text mining techniques to the biological literature. This work extends pr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004